| id | name | color |
|---|---|---|
| 1 | floof | gray |
| 2 | max | black |
| 3 | cat | orange |
| 4 | donut | gray |
| 5 | merlin | black |
| 6 | panda | calico |
Tidy data with tidyr
Day 3 - Introduction to Data Analysis with R
1 What is tidy data?
1.1 What is tidy data?
1.2 What is tidy data?
Let’s look at some examples
Tidy
Non-tidy
| floof | max | cat | donut | merlin | panda |
|---|---|---|---|---|---|
| gray | black | orange | gray | black | calico |
| gray | black | orange | calico |
|---|---|---|---|
| floof | max | cat | panda |
| donut | merlin |
. . .
Sometimes raw data is non-tidy because its structure is optimized for data entry or viewing rather than analysis.
1.3 Why tidy data?
The main advantages of tidy data is that the tidyverse packages are built to work with it.
1.4 Example
Let’s go back to the city data set from earlier:
Expand to reproduce the data
cities <- c("Istanbul", "Moscow", "London", "Saint Petersburg", "Berlin", "Madrid", "Kyiv", "Rome", "Bucharest", "Paris")
population <- c(15.1e6, 12.5e6, 9e6, 5.4e6, 3.8e6, 3.2e6, 3e6, 2.8e6, 2.2e6, 2.1e6)
area_km2 <- c(2576, 2561, 1572, 1439, 891, 604, 839, 1285, 228, 105)
country <- c(
"Turkey", "Russia", "UK", "Russia", "Germany", "Spain",
"Ukraine", "Italy", "Romania", "France"
)
# tidy
cities_tbl <- tibble(
city_name = cities,
population_size = population,
city_area = area_km2,
country = country
)cities_tbl
#> # A tibble: 10 × 4
#> city_name population_size city_area country
#> <chr> <dbl> <dbl> <chr>
#> 1 Istanbul 15100000 2576 Turkey
#> 2 Moscow 12500000 2561 Russia
#> 3 London 9000000 1572 UK
#> 4 Saint Petersburg 5400000 1439 Russia
#> 5 Berlin 3800000 891 Germany
#> 6 Madrid 3200000 604 Spain
#> 7 Kyiv 3000000 839 Ukraine
#> 8 Rome 2800000 1285 Italy
#> 9 Bucharest 2200000 228 Romania
#> 10 Paris 2100000 105 France. . .
This already looks pretty tidy.
1.5 Same data different format
Expand to reproduce the data
cities_untidy <- unite(cities_tbl, col = "location", c(country, city_name)) |>
pivot_longer(c(population_size, city_area), names_to = "type") |>
pivot_wider(names_from = "location", values_from = "value")cities_untidy#> # A tibble: 2 × 11
#> type Turkey_Istanbul Russia_Moscow UK_London Russia_Saint Petersb…¹
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 population_size 15100000 12500000 9000000 5400000
#> 2 city_area 2576 2561 1572 1439
#> Germany_Berlin Spain_Madrid Ukraine_Kyiv Italy_Rome Romania_Bucharest
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3800000 3200000 3000000 2800000 2200000
#> 2 891 604 839 1285 228
#> # ℹ abbreviated name: ¹`Russia_Saint Petersburg`
#> # ℹ 1 more variable: France_Paris <dbl>
What’s not tidy here?
. . .
- Each row has multiple observation
- At the same time, each observation is split across multiple rows
- Country and city variable are split into multiple columns
- Country and city variable values are united to one value
1.6 The tidyr pacakge
Let’s tidy this data using functions from the tidyr package!
. . .
First load the package with either
library(tidyr)or
library(tidyverse)1.7 pivot_longer()
One variable split into multiple columns can be solved with pivot_longer
#> # A tibble: 2 × 11
#> type Turkey_Istanbul Russia_Moscow UK_London Russia_Saint Petersb…¹
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 population_size 15100000 12500000 9000000 5400000
#> 2 city_area 2576 2561 1572 1439
#> Germany_Berlin Spain_Madrid Ukraine_Kyiv Italy_Rome Romania_Bucharest
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3800000 3200000 3000000 2800000 2200000
#> 2 891 604 839 1285 228
#> # ℹ abbreviated name: ¹`Russia_Saint Petersburg`
#> # ℹ 1 more variable: France_Paris <dbl>
1.8 pivot_longer()
One variable split into multiple columns can be solved with pivot_longer
step1 <- pivot_longer(
cities_untidy, # the tibble
cols = Turkey_Istanbul:France_Paris, # the columns to pivot from:to
names_to = "location", # name of the new column
values_to = "value"
) # name of the value column. . .
#> # A tibble: 20 × 3
#> type location value
#> <chr> <chr> <dbl>
#> 1 population_size Turkey_Istanbul 15100000
#> 2 population_size Russia_Moscow 12500000
#> 3 population_size UK_London 9000000
#> 4 population_size Russia_Saint Petersburg 5400000
#> # ℹ 16 more rows
1.9 pivot_longer()
One variable split into multiple columns can be solved with pivot_longer
step1 <- pivot_longer(
cities_untidy, # the tibble
cols = Turkey_Istanbul:France_Paris, # the columns to pivot from:to
names_to = "location", # name of the new column
values_to = "value"
) # name of the value columnAnother way to select the columns to pivot:
1.10 separate_wider_delim()
Multiple variable values that are united into one can be separated using separate_wider_delim
#> # A tibble: 20 × 3
#> type location value
#> <chr> <chr> <dbl>
#> 1 population_size Turkey_Istanbul 15100000
#> 2 population_size Russia_Moscow 12500000
#> # ℹ 18 more rows
. . .
step2 <- separate_wider_delim(
step1, # the tibble
location, # the column to separate
delim = "_", # the separator
names = c("country", "city_name")
) # names of new columns. . .
#> # A tibble: 20 × 4
#> type country city_name value
#> <chr> <chr> <chr> <dbl>
#> 1 population_size Turkey Istanbul 15100000
#> 2 population_size Russia Moscow 12500000
#> # ℹ 18 more rows
. . .
The opposite function exists as well and is called unite. Check out ?unite for details.
1.11 pivot_wider()
One observation split into multiple rows can solved with pivot_wider
#> # A tibble: 20 × 4
#> type country city_name value
#> <chr> <chr> <chr> <dbl>
#> 1 population_size Turkey Istanbul 15100000
#> 2 population_size Russia Moscow 12500000
#> # ℹ 18 more rows
. . .
step3 <- pivot_wider(
step2, # the tibble
names_from = type, # the variables
values_from = value
) # the values. . .
#> # A tibble: 10 × 4
#> country city_name population_size city_area
#> <chr> <chr> <dbl> <dbl>
#> 1 Turkey Istanbul 15100000 2576
#> 2 Russia Moscow 12500000 2561
#> 3 UK London 9000000 1572
#> 4 Russia Saint Petersburg 5400000 1439
#> 5 Germany Berlin 3800000 891
#> # ℹ 5 more rows
1.12 All steps in 1
We can also use a pipe to do all these steps in one:
cities_tidy <- cities_untidy |>
pivot_longer(
Turkey_Istanbul:France_Paris,
names_to = "location",
values_to = "values"
) |>
separate_wider_delim(
location,
delim = "_",
names = c("country", "city_name")
) |>
pivot_wider(
names_from = type,
values_from = values
)1.13 Remove missing values with drop_na()
Drop rows with missing values:
# drop rows with missing values in any column
drop_na(and_vertebrates)
# drop rows with missing values in weight column
drop_na(and_vertebrates, weight_g)
# drop rows with missing values in weight and species columns
drop_na(and_vertebrates, weight_g, species). . .
This is an easier and more intuitive alternative to filter(!is.na(...)).
2 Now you
Task (30 min)
Tidy data with tidyr
Find the task description here